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ABSTRACT 

Online Social Networks (OSNs) have exploded in terms of 
scale and scope over the last few years. The unprecedented 
growth of these networks present challenges in terms of sys- 
tem design and maintenance. One way to cope with this is 
by partitioning such large networks and assigning these par- 
titions to different machines. However, social networks pos- 
sess unique properties that make the partitioning problem 
non-trivial. The main contribution of this paper is to under- 
stand different properties of social networks and how these 
properties can guide the choice of a partitioning algorithm. 
Using large scale measurements representing real OSNs, we 
first characterize different properties of social networks, and 
then we evaluate qualitatively different partitioning methods 
that cover the design space. We expose different trade-offs 
involved and understand them in light of properties of social 
networks. We show that a judicious choice of a partitioning 
scheme can help improve performance. 

1. INTRODUCTION 

The last few years have seen wide-scale proliferation 
of Online social networks (OSNs). According to a recent 
study, OSNs have become more popular than email [16] 
and popular OSNs like Facebook, MySpace, Orkut etc. 
have tens of millions of active users, with more users 
being added by the day. Twitter, for instance, grew a 
remarkable 1382% in one month(Feb-Mar 09) [17]. The 
growth of OSNs pose unique challenges in terms of scal- 
ing, management and maintenance of these networks 
[8]. In addition, online services are moving towards an 
increasingly distributed cloud paradigm [13, 6, 4], that 
bring forth their own challenges since smaller and geo- 
graphically spread data centers [6] will increase network 
costs. 

One obvious solution to deal with scaling issues would 
be partition the social network graph and assign par- 
titions to different servers. However, given the unique 
properties of social network graphs (presence of clus- 
ters or communities, skewed degree distributions, cor- 
relations to geography), it is not entirely clear how to 
partition such graphs, what properties of the underlying 
social network graph need to be taken into account and 
what are the trade-offs involved in partitioning. Under- 
standing the design space and the trade-offs can help in 
making informed choices. 



The main contribution of this paper is to provide an 
understanding of the problem of social network par- 
titioning by relying on measurements taken from two 
large and popular social networks: Twitter and Orkut. 
We state the problem of social network partitioning 
along with the performance objectives, highlighting the 
trade-offs involved. We then discuss the different char- 
acteristics of social network graphs that can impact de- 
sign choices of partitioning algorithms. In particular, 
using the data we obtain for the Twitter network, we 
measure strong geographical locality as well as heteroge- 
nous traffic patterns between users. 

Based on the performance objectives and social net- 
work characteristics, we select the following partitioning 
algorithms to evaluate against our data-sets. We first 
choose METIS [10] that is drawn from traditional graph 
partitioning algorithms [9, 10, 23]. This is an intuitive 
choice as these algorithms are designed to produce par- 
titions while reducing traffic across partitions and bal- 
ancing the number of nodes across partitions. Since it 
is widely known that social networks consist of 'com- 
munity' structure [14, 21], the next method we eval- 
uate relies on extracting such communities, and these 
communities can be used as partitions. However cer- 
tain issues with such algorithms, in particular arbitrary 
number of communities and skewed partition sizes, lead 
us to augment an existing community detection scheme 
to deal with these issues. We evaluate these algorithms 
against different performance metrics to expose various 
trade-offs, in particular the trade-off between balancing 
load across different partitions and reducing traffic be- 
tween partitions. After evaluating different algorithms 
against our datasets, we find that the augmented com- 
munity detection that we devised does well in terms of 
balancing the trade-offs involved and we interpret these 
results in light of properties that social networks pos- 
sess. 

2. RELATED WORK 

Graph partitioning has been proposed to deal with 
scaling issues of social network systems [8, 7], as well 
as handHng large data-sets [27]. However, to the best 
of our knowledge, there is no study of different proper- 
ties of social network graphs, how these properties can 
impact the possible different choices of partitioning al- 
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gorithms and how different partitioning algorithms per- 
form on real data, along with quantifying the different 
tradeoffs involved. We aim to address these concerns in 
this paper. 

There has been a concerted effort to characterize and 
understand Online Social Networks over the last few 
years [18, 3, 25, 12, 19], including flow of information 
[11], existence of social communities [14] as well as evo- 
lution of such networks [18, 1]. As such, we arc more 
focused on understanding the aspects of social networks 
that can impact partitioning. Towards that end, we rely 
on certain results (like existence of social communities 
[14, 20]) that have been reported in the past and also 
report existence of certain properties (like geographi- 
cal locality and heterogenous traffic distribution) in one 
network we study; Twitter [12] that can better inform 
system architects interested in social networks, as well 
as design of partitioning schemes. 

In order to understand the trade-offs involved, we rely 
on different partitioning schemes that span a range of 
design choices. We first focus on a method drawn from 
classical graph partitioning algorithms [9, 10, 23, 24] 
that rely on flnding minimum edge cuts to seperate the 
graph into roughly equal-sized clusters. We also draw 
from work done that aims to extract 'communities' that 
are socially relevant [21, 2]. However these algorithms 
do not give a desired size of communities, nor do they 
give equal sized partitions. We present a scheme in this 
paper that yields a desired number of communities, of 
equal size. 

3. METHODOLOGY 

In this section, we pose the problem of social network 
partitioning, along with the objectives. We then explore 
properties of social network graphs that could impact 
the design and choice of different partitioning schemes. 
We end this section with a brief description of some of 
the methods wc explore in this study. We first describe 
the data-sets that we collect and use. 

3.1 Data 

In order to carefully understand the different charac- 
teristics of social network graphs that can impact parti- 
tioning, we rely on data. There exist data-sets of large 
social networks that are publicly available [19]. How- 
ever, we were looking for data-sets that included lo- 
cation information, as well as information about traffic 
that is exchanged between users. Since we are not aware 
of any publicly available data-sets, we collected our own 
data-set by crawling Twitter (http://www.twitter.com). 
Twitter is a microblogging site that has become very 
popular of late. More details about Twitter, along 
with information on how links are formed, can be found 
in [12]. We collected data from Twitter between Nov 
25 - Dec 4, 2008 and collected information comprising 
2, 408, 534 nodes and 38, 381, 566 edges (ahhough Twit- 
ter has directed edges, we report total edges and use 
edges as undirected unless otherwise stated). 

In addition, we also collect location information, as 



self-identified by users. This can be in the form of free- 
text or by latitude/longitude. In order to glean loca- 
tions, we had to filter the free-text for junk informa- 
tion. After basic preprocessing we obtained 187K dif- 
ferent locations for the 996K users where the location 
text was meaningful. We then used the Yahoo! Maps 
Web Services^ that offers the Yahoo Maps functional- 
ity as an API to standardize locations. We could test 
the 187K different locations to obtain the country, and 
where available, the state and the city. We finally ob- 
tained the standardized locations for 691K users. 

Wc also collected traffic on Twitter in the form of 
'tweets' (total 12M tweets) between the 2.4M users by 
using the Twitter API^. From the tweets, we mined the 
following relevant fields: tweet-id, timestamp, userJd, 
location and content. This information allows us to get 
traffic information for every user and how this traffic is 
distributed across users (this gives us volume of social 
conversations). Approximately 25% of the population 
(587K) generated at least one tweet, for the rest of the 
users the Twitter API did not return tweets in the 19- 
days period under examination. Fig. 1(a) and (b) show 
the skewed distribution of the number of links per user 
as well as the volume of traffic sent per user. We intend 
to make this data public. 

We also use graphs representing other social networks, 
including Orkut, that can be taken to be closer to an ac- 
tual 'social' network, as edges in such a graph will prob- 
ably represent actual social relationships. This dataset 
consists of 3, 072, 441 nodes and 223, 534, 301 edges and 
this data was collected between Oct 3 - Nov 11, 2006. 
More information can be found in [19]. We present re- 
sults from Twitter and Orkut in this paper for space 
reasons, although our analysis on other graphs [19] have 
borne similar results. 

3.2 Social Network Partitioning 

We represent a social network as a graph G = (V,E), 
where the edges are undirected. The problem then is 
to partition V into k subsets or clusters, (Vi, V2, .., V^), 
k > I such that Vi n = when i ^ j and UiVi = V. 
The objective function to decide how to obtain these k 
partitions differ. 

Performance Objectives The first objective fimc- 
tion for a partitioning scheme would be to reduce the 
edges that traverse between partitions. We refer to 
these edges as external or inter-edges. This metric rep- 
resents the amount of traffic that could traverse between 
partitions and reducing this metric therefore, has a di- 
rect bearing on reducing bandwidth costs. Bandwidth 
costs can be a bigger concern if a social network graph 
is hosted across geographically distributed data-centers 
[13]. 

The second objective function is related to keeping 
utilization high and balanced across servers hosting dif- 
ferent partitions. Ideally, all partitions should be of 

http://developer.yahoo.com/maps/rest/Vl/geocode.html 
^http: / / apiwiki.twitter.com 
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Figure 1: Twitter Characteristics (a) CDF of Followers (b) CDF of 'Tweets '/Messages per user (c) 
'Tweets' per hour 



equal size, leading to balanced utilization across ma- 
chines. It has been noted that given the costs of server 
installation and limited lifetime, keeping servers on and 
running optimizes work per investment dollar [6]. By 
assigning nodes to different servers (equivalent to as- 
signing partitions) in an informed manner can help achieve 
this goal. 

Balancing load across servers and reducing inter-server 
traffic are often at odds with each other and hence 
there is a need to study the design space of partition- 
ing schemes to characterize the tradeoffs involved. In 
addition, social networks have slightly different underly- 
ing characteristics compared to regular communication 
networks that can be exploited for the design of parti- 
tioning schemes. We discuss these characteristics in the 
next section. 

3.3 Properties of Social Networks 

Properties that can be relevant to the problem of par- 
titioning are: 

Structural Properties: Social networks are known 
to have skewed degree distributions, assortativity [20], 
small world properties [26] as well as strong clustering 
or community structure [14]. Partitioning by uncover- 
ing communities in a social network can be intuitively 
appealing as volume of intra-community interaction is 
much higher than inter-community interaction. 

Geo-Locality Properties: Social networks have been 
shown to have strong correlations to geography; the 
probability of new links forming is correlated to dis- 
tance [15]. We observe similar strong correlations to 
distance in the Twitter data-set. In Table 1, we present 
locality results of the top 5 countries and US states by 
size. We consider directed links. The second column 
shows the size distribution of different countries; for in- 
stance US has 60.2% of the total nodes. We also found 
the percentage of edges are closely correlated to the 
size of the network, for instance 66% of the total inlink 
edges and 65.01% of the total outlink edges belong to 
the US. The next two columns represent the percentage 
of inlinks and outlinks that come from users of the same 
country. For instance, 80% of inlinks in US come from 
nodes belonging to US, and 81% of outlinks in US also 



belong to nodes within US. If we compare these num- 
bers to the relative sizes of the partitions, we realize that 
there is high geo-locality; for a partition the percentage 
of links that are local is higher than what would be the 
case if connectivity is purely random; for instance in the 
case of US this would be 60.2% Factoring this locality 
property could yield performance benefits, specially if 
partitions are assigned to servers distributed geograph- 
ically [6, 4]. Strong locality can thus reduce inter-server 
traffic considerably. 

In addition, relying on semantic information like geo- 
graphical location for partitioning can be appealing due 
to the simplicity as one does not need global information 
about the structure of the social network, and issues like 
churn (users joining/leaving the social network) can be 
handled more effectively and efficiently. We do not pro- 
pose a scheme based on geographical locality in this 
paper and leave it for future work. However, we note 
that the community structure embeds some information 
about locality [15]. 

Traffic Properties: Traffic on social networks pri- 
marily consists of messages, user-generated content (UGC) 
and status updates exchanged between nodes. From our 
analysis of traffic on Twitter, we observed some simi- 
larities to traffic on IP networks like diurnal patterns 
(Fig. 1(c)). However there are a few facets that makes 
traffic on social networks unique, and hence worth pay- 
ing special attention to. First of all, traffic in social 
networks is inherently local; most traffic is confined to 
one-hop distance. Secondly, traffic in social networks 
can have a multiplicative effect due to the broadcast na- 
ture of certain messaging protocols like status updates. 
To illustrate this point, consider Figs.l (a, b). These 
figures show the skewed nature of the number of follow- 
ers (social contacts) a user in Twitter has, as well as 
the skewed nature of the number of messages (tweets) 
sent per user. Given the broadcast nature of status up- 
dates in Twitter, the 12M unique tweets, coupled with 
the skewed distribution of the number of followers, the 
actual number of tweets are in excess of 1.7B messages. 

In addition, we also extracted conversations between 
users using the content in tweets; we establish a link 
between user i and j iif i-id appears in the content of 
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Country 


Size (%) 


% of inlinks 


% of outlinks 




State 


Size (%) 


% of inlinks 


% of outlinks 


US 


60.2 


79.9 


81.0 




CA 


16.9 


32 


37 


GB 


6.3 


32.9 


31.8 




NY 


8.3 


22 


25 


CA 


4.02 


26.7 


25.2 




TX 


7.2 


28 


28 


BR 


3.0 


65 


62.2 




FL 


4.6 


19 


17.6 


JP 


2.9 


80 


83 




IL 


4.1 


20 


20 



Table 1: Presence of Strong GeoLocality (Countries and US States): Twitter 



at least one tweet of j and viec- versa. We obtain 265K 
links betwc;en people who are maintaining a conversa- 
tion, which is clearly less than the 38M links between 
people according to the followers SN. 

This information captures the social network at a 
finer level than the social network given by a simple 
contact list; people have users in ones' contact list that 
they may not actually know [3] . Conversations imply a 
more deeper social relation, therefore they can be used 
to validate the accuracy of the partition algorithms in 
preserving such strong social relations. 



Algorithm 1 M0+ 

Input: G = {V, E) Social network graph 

Input: k Number of final partitions 

Input: Modularity optimization algorithm 

Output: P Partition set, Pi set of nodes of partition 

i, \P\=k 

U^{{l..\V\]} 

while U ^%do 

G' ^ {V\vi ^ Ui, E\eij hi ^ C/i V Vj ^ Ui) 

C T (G") — partition algorithm 

C" ^ C I \ci\> \ci+\ \ — sort the best partition by 

size 

for Cj € C do 
if |ci| > then 

[/ ^ [/ U {ci\ 
else 

for "pj e P do 

if Ci7^0A(|pj| + |ci| then 
Pj <— pj U Ci 

end if 
end for 
end if 
end for 

u ^ c/\{c/i} 

end while 



3.4 Partitioning Methods 

In order to explore the design space for partitioning 
algorithms and quantify the tradeoffs involved, we study 
the following methods: 

Graph Partitioning (GP) Algorithms: The ob- 
jective function that these algorithms rely on is to re- 
duce inter-cluster edges; rciduce the number of edges 
incident on vertices belonging to different clusters. In 
addition, the partitions should be balanced; have simi- 



lar sizes. Hence such algorithms are a natural candidate 
to study as the performance objectives for our problem 
are addressed by such algorithms. Spectral partitioning 
algorithms rely on finding the minimum cut to ensure 
the minimum number of edges cross between partitions 
[9] . Many different variants have been proposed and we 
rely on the Multi-way partitioning method, also called 
METIS[10] that has been shown to produce very high 
quality clusters in a fast and efficient manner. 

Modularity Optimization (MO) Algorithms: So- 
cial graphs with communities are intuitively different 
from random graphs in terms of structure [20] . This dif- 
ference can be captured via a metric called modularity 
{Q) that is defined as the difference between the number 
of edges within communities and the expected number 
of such edges. Mathematically, Q ~ Aij — Pij, where 
Aij is the actual number of edges traversing two com- 
munities, and Pij ~ di*dj, where di is degree of node i. 
The modularity metric is between (0, 1) and lower val- 
ues signify a structure closer to random graphs, while 
higher values signify strong community structure. The 
algorithms developed for community detection therefore 
rely on finding partitions that maximize this metric. For 
the purposes of this study, the modularity optimization 
scheme we use is proposed in [2] , that suits our purpose 
and is capable of handling large graphs, very quickly. 
However, there are a couple of caveats: the commu- 
nities can be unequal in size - this is closer to reality 
as real-life communities arc seldom of equal size. The 
other problem is that there can be an arbitrary num- 
ber of communities, as the objective of such algorithms 
is to find the natural number of communities, rather 
than find a predefined number. When we ran the al- 
gorithm on our datasets, we got 2743 and 37 commu- 
nities for Twitter (modularity=0.48) and Orkut (mod- 
ularity =0.63) respectively, with highly skewed commu- 
nity sizes (size of largest community were 21% and 25% 
of the total number of nodes), making these scheme not 
amenable for using as-is. 

Modularity Optimization Enhanced (MO-|-): 
In order to obtain a predefined and fixed number of 
communities, each community of a fixed size, we modify 
the results obtained by a modularity optimizing algo- 
rithm. In order to obtain a predefined number of parti- 
tions of equal size we devised the algorithm M0+ 1 that 
can be applied as a post-processing step to the results 
of a modularity optimization algorithm MO. At a high 
level, we start grouping the communities in a partition 
(with a predefined size) sequentially until the partition 
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Figure 2: Partitioning Results: %of Intra-Partition Edges vs # of Partitions (a) Twitter, (b) Orkut 
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Figure 3: Partitioning Results with Traffic: %of Intra-Partition Edges vs # of Partitions (a) Twitter, 
(b) Conversations 



is full, when we move to next available partition. In 
the case the community does not fit, we apply MO to 
the subset graph recursively. This scheme is extremely 
simple and naive, but for our purpose, it suffices. 

Random Partitioning: As a baseline scheme, we 
also consider partitioning randomly. Random parti- 
tioning has few advantages - in addition to being very 
simple to implement, randomization balances load op- 
timally. Hence if we are only concerned with balanc- 
ing load, then random partitioning will give the best 
results. However such a scheme may not do well in re- 
ducing inter-partition traffic. 
4. EXPERIMENTS 

The algorithms we try out include Random (R) , Graph 
Partitioning (GP), Modularity Optimization-I- (MO-I-) 
for Twitter and Orkut. GPw and MO+w stand for the 
weighted versions of GP and M0+, where the weight 
can represent traffic on edges. We use the average of 
the number of tweets of exchanged between users; in 
the case that edge weight is 0, we set the weight to I 
to avoid disconnecting the graph. We study the effect 
of weighted edges for Twitter only since we do not have 
access to the traffic of Orkut. 

4.1 Communication across Partitions 



Wc first evaluate how different partitioning schemes 
perform in terms of the number of a) edges, b) messages 
and c) conversations that exist within partitions. Note 
that higher the m^ra-partition, lower the network traffic 
between machines. 

Intra-Partition Edges: Fig. 2 depicts the percent- 
age of mtra-partition edges. First of all, this metric is 
low for the Random, and it decreases as where k is 
the number of partitions. The other schemes decrease at 
a much slower rate showing that they are able to main- 
tain the structure of the underlying social network. In 
network traffic terms. Random with 256 partitions re- 
sults in a mere 0.4% of intra-partition edges, while GP 
and MO-I- produce 10% to 30% of intra-partition edges 
for Twitter. For Orkut, the reduction on network traffic 
is even higher - both GP and M0-|- produce around 50% 
intra-partition edges. GP and MO-I- yield very similar 
results for Orkut and Twitter with small-size partitions. 
However, M0+ (and the weighted counterpart MO-|-w) 
seems to be more consistent across different number of 
partitions. 

Intra-Partition Messages: For the next set of re- 
sults, we use actual actual traffic, in the form of mes- 
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Figure 4: Distribution of Peak Load (a) 8 (b) 32 (c) 128 

sages exchanged (1.7B messages, as reported in Sec. 3.3) 



and run them against the partitions produced. Here we 
use directed edges. Note that this is quaUtatively differ- 
ent from merely counting the number of edges. Fig. 3 
(a) shows intra-partition messages and the results are 
qualitatively similar to results in Fig. 2. Again, M0+ 
gives better partitions than GP when the number of 
partitions increase and both of them perform better 
than Random. 

Intra-Partition Conversations: Conversations link 
people who are likely to have a stronger social relation- 
ship, and therefore, they are more susceptible to con- 
sume each others user generated content [5, 22], e.g. 
videos, pictures, etc. Thus, it can be extrapolated that 
these links will generate additional traffic besides mes- 
sages that we capture. The 265K conversational links 
also serve as a ground-truth. We expect the social net- 
work based partition schemes to be able to retrieve a 
majority of these strong links. Fig. 3(b) shows that the 
percentage of conversations is higher than the intra- 
partition edges; MO-I- retains more than 50% of these 
social links even when the number of partition is 256, 
stressing the importance of using schemes that are more 
aware of the underlying social structure. 

4.2 Partitions: Balancing Load 

We showed that partitions based on the network struc- 
ture(GP, MO-I-) can significantly reduce network traffic 
while compared to a Random. However, we also need 
to study the effect of these schemes on the load distri- 
bution across the machines that host these partitions. 
We extract the distribution of the arrival of tweets per 
minute from the trace we have. ^ If there was no parti- 
tioning, a single machine should be dimensioned (CPU, 
memory, I/O) to deal with an average of 448 write re- 
quests/min. Most systems rely on worst-case provision- 
ing. In order to perform worst-case analysis, we ex- 
tracted the 99.99% percentile traffic rate that came to 
be 951 writes/min. In other words, without partition- 



^Note that a tweet is a write request, it has to be stored 
and distributed. Read operations are more common but 
require much less resources since systems are designed for 
faster reads. Common practice is that the QoS of a read 
request should be < 200ms, while for a write it can be an 
order of magnitude higher. 



ing, this would be the peak traffic that a single machine 
would have to handle. 

In Fig. 4 we plot the CCDF of the write request 
(tweets arrival) per min across all partitions for different 
partition sizes (8,32,128 respectively). For 8 machines 
(Fig. 4 (a)), the 99.99% percentile peak of traffic is 136, 
171, 246 req/min for Random, MO-I- and GP respec- 
tively. After partitioning the machines only need to 
have resources to deal with this reduced load peak. For 
the partition in 128 machines (Fig. 4 (c)) the load peak 
is 21,35,28 req/min for Random, MO-I- and GP. 

The picture that emerges from our experiments is 
as follows. Schemes based on modularity optimization 
are good for reducing network traffic, and are compet- 
itive with Random in terms of load balancing across 
partitions. Hence while the tradeoff between reducing 
bandwidth costs and balancing load is still present, we 
find that MO based schemes are promising. We inter- 
pret these results in light of the arguments we made in 
Sec 3.3, more specifically the existence of social conver- 
sations, and the community structure. 

5. DISCUSSION 

Online social networks are increasing at a very high 
rate, putting a strain on existing infrastructure and de- 
manding new architectural solutions. As such systems 
get pushed to highly distributed cloud infrastructure, 
the problem of finding intelligent solutions to scaling 
issues is further exacerbated. 

In this paper we explored the problem of partition the 
users' space of different OSN using different schemes. 
By using data from real OSN - Twitter and Orkut - 
we measured the impact of different partition schemes 
on network traffic and peak load. We found that tradi- 
tional graph partitioning techniques can effectively be 
applied to OSNs to reduce the network overhead that 
a typical off-the-shelf random partition would generate 
while maintaining an acceptable peak load distribution. 
We also show that modularity optimization algorithms 
are even better suited to the social network partition 
problem given that they are based on finding 'commu- 
nity' structures in the network. Such a scheme, how- 
ever, presents the problem of producing arbitrary num- 
ber of communities and we propose a post-processing 
algorithm MO-I- to handle this issue. 
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